The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.
translated by 谷歌翻译
Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at
translated by 谷歌翻译
The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are (i) the spatiotemporal architecture design, (ii) the multimodal fusion schemes, (iii) the pretraining objectives, (iv) the choice of pretraining data, (v) pretraining and finetuning protocols, and (vi) dataset and model scaling. Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at:
translated by 谷歌翻译
translated by 谷歌翻译
本报告描述了我们的提交称为“ tarheels”的EGO4D:对象状态变更分类挑战。我们使用基于变压器的视频识别模型,并利用分隔的时空注意机制来对以中心视频的对象状态变化进行分类。我们的提交在挑战中取得了第二好的表现。此外,我们进行了一项消融研究,以表明识别以egipentric视频中的对象状态变化需要时间建模能力。最后,我们提出了几个积极和负面的例子,以可视化模型的预测。该代码可公开可用:
translated by 谷歌翻译
translated by 谷歌翻译
我们介绍了一种视听方法,用于远程文本到视频检索。与以前专为简短视频检索设计的方法(例如,持续时间为5-15秒)不同,我们的方法旨在检索捕获复杂人类动作的长时间视频。仅标准视频方法的一个挑战是与从这样的长视频中处理数百个密集提取的帧相关的大量计算成本。为了解决这个问题,我们建议用紧凑的音频提示替换视频的部分,这些线索简洁地汇总了动态音频事件,并且处理便宜。我们的方法称为Eclipse(带有声音编码的有效剪辑),通过添加一个统一的视听变压器块,将流行的剪辑模型调整为视听视频设置,该块从视频和音频流中捕获互补的提示。除了比仅长期视频的方法快2.92倍和2.34倍的内存效率外,我们的方法还可以在几个不同的远程视频数据集上,例如ActivityNet,QVHighighlights,Youcoook2,Youcoook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2, Didemo和Charades。
translated by 谷歌翻译
大多数现代视频识别模型旨在在短视频剪辑上运行(例如,长度为5-10)。因此,将此类模型应用于长时间的电影理解任务是一项挑战,通常需要复杂的长期时间推理。最近引入的视频变形金刚通过使用远程时间自我注意来部分解决此问题。但是,由于自我注意力的二次成本,这种模型通常是昂贵且不切实际的。取而代之的是,我们提出了Vis4mer,这是一种有效的远程视频模型,结合了自我注意力的优势和最近引入的结构化状态空间序列(S4)层。我们的模型使用标准的变压器编码器进行短距离时空特征提取,以及多尺度的时间S4解码器,用于随后的远程时间推理。通过逐步减少每个解码器层处的时空特征分辨率和通道维度,Vis4mer在视频中学习了复杂的长期时空依赖性。此外,比相应的基于纯的自我注意力的模型,Vis4mer的价格更快为$ 2.63 \ times $ $,$ 8 \ times $ $ GPU内存。此外,Vis4mer实现最先进的结果,在长期视频理解(LVU)基准中,$ 9 $ 9 $长的电影视频分类任务中的$ 6 $。此外,我们表明我们的方法成功地将其推广到其他领域,从而在早餐和硬币程序活动数据集中取得了竞争成果。该代码可在以下网址公开获取:。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译